image.png

Introduction

Welcome to my Data Science project on A/B TESTING ANALYSIS for a website survey.

The project utilizes a dataset with features for more than 8,000 users of a website, as well as their click-through response to two different versions of a survey. The results of the test are analyzed and evaluated through data exploration, sanity checks and statistical tests. Recommendations are provided on whether is safe and worthy to launch the experimental version "B" of the survey. In addition, a new A/B testing is proposed, and its required sample size is estimated based on the desired statistical power and minimum detectable effect.

I also invite you to visit My LinkedIn profile and see my other projects in My GitHub profile.

Sincerely,

Michail Mavrogiannis


Data Overview

Dataset Import

Import libraries below:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy import stats
import statsmodels as sm
import statsmodels.api as sma

Import dataset into dataframe "dfr".

In [2]:
dfr = pd.read_csv('C:/Users/Michael/Desktop/Data Science- MM/ABTest/Data.csv')
dfr.head()
Out[2]:
auction_id experiment date hour device_make platform_os browser yes no
0 0008ef63-77a7-448b-bd1e-075f42c55e39 exposed 7/10/2020 8 Generic Smartphone 6 Chrome Mobile 0 0
1 000eabc5-17ce-4137-8efe-44734d914446 exposed 7/7/2020 10 Generic Smartphone 6 Chrome Mobile 0 0
2 0016d14a-ae18-4a02-a204-6ba53b52f2ed exposed 7/5/2020 2 E5823 6 Chrome Mobile WebView 0 1
3 00187412-2932-4542-a8ef-3633901c98d9 control 7/3/2020 15 Samsung SM-A705FN 6 Facebook 0 0
4 001a7785-d3fe-4e11-a344-c8735acacc2c control 7/3/2020 15 Generic Smartphone 6 Chrome Mobile 0 0

The experiment is about testing the Click-Through Probability for two versions of a website survey. The survey includes a single yes/no type question, to which the users can respond through "radio" buttons. No information is provided about the format of the survey (popup, widget, or other), or about the content of the question. Thus, this project will focus on the Click-Through probability for each version of the survey, i.e. whether the users clicked at all or not, on the survey. It will not focus on the probability of each of the possible responses, (yes or no).

Description of Features:

  • auction_id: ID of the survey impression, unique per user.

  • experiment: Whether the user belongs to 'control' or 'exposed' (experiment) group.

  • date: Date in YYYY-MM-DD format.

  • hour: Hour in HH format.

  • device_make: Make and model of the user device.

  • platform_os: Operating System of the user device platform, represented by a code.

  • browser: Browser on which the user sees the website and the survey.

  • yes: The survey includes a question to which the user can respond through "radio buttons", 1: the user responded "yes", 0: the user either responded "no" or did not click on the survey at all*.

  • no: The survey includes a question to which the user can respond through "radio buttons", 1: the user responded "no", 0: the user either responded "yes" or did not click on the survey at all*.

*If both columns 'yes' and 'no' are zero, it means that the user did not click on the survey at all.

In [3]:
dfr.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8077 entries, 0 to 8076
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   auction_id   8077 non-null   object
 1   experiment   8077 non-null   object
 2   date         8077 non-null   object
 3   hour         8077 non-null   int64 
 4   device_make  8077 non-null   object
 5   platform_os  8077 non-null   int64 
 6   browser      8077 non-null   object
 7   yes          8077 non-null   int64 
 8   no           8077 non-null   int64 
dtypes: int64(4), object(5)
memory usage: 568.0+ KB

References

The dataset used as an input in the present project was obtained from the following post on website kaggle.com; https://www.kaggle.com/osuolaleemmanuel/ad-ab-testing?select=AdSmartABdata+-+AdSmartABdata.csv. I would like to thank the author of the post, Osuolale Emmanuel, for granting me permission to use the dataset.

Preprocessing

Column 'auction_id' is renamed to the more intuitive 'user_id' and 'platform_os' is renamed to 'operating_sys':

In [4]:
dfr.rename(columns = {'auction_id': 'user_id', 'platform_os': 'operating_sys'}, inplace = True)

'Experiment' column currently contains string values "exposed" or "control". These are changed to 1: for experiment group, and 0: for control group.

In [5]:
dfr['experiment'].unique()
Out[5]:
array(['exposed', 'control'], dtype=object)
In [6]:
dfr['experiment'] = dfr['experiment'].map({'exposed': 1, 'control': 0})

Column 'date' currently contains strings which are here converted into datetime objects. Also new columns are added for the day of the week, and concatenated date and day of week:

In [7]:
dfr['date'] = dfr['date'].apply(lambda x: pd.to_datetime(x))
In [8]:
dct1 = {0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'}
dfr['day'] = dfr['date'].apply(lambda x: x.dayofweek).map(dct1)
In [9]:
dfr['temp'] = dfr['date'].apply(lambda x: x.date()) 
dfr['date, day'] = dfr.apply(lambda x: str(x['temp']) + ', ' + str(x['day']), axis = 1)
dfr.drop('temp', axis = 1, inplace = True)

A new column is created to show whether the user clicked or not on the survey, at all; 0: did not click, 1: clicked.

In [10]:
dfr['clicked'] = dfr.apply(lambda x: 1 if x['yes'] == 1 or x['no'] == 1 else 0, axis = 1)

Information on the content of the survey question is not available, therefore responses "yes" or "no" cannot be interpreted in a way insightful for the A/B testing experiment. Columns 'yes' and 'no' are then removed:

In [11]:
dfr.drop(['yes', 'no'], axis = 1, inplace = True)
In [12]:
dfr = dfr[['user_id', 'experiment', 'date', 'day', 'date, day', 'hour', 'operating_sys', 'browser', 'device_make', 
        'clicked']]

New Features Description & Summary Statistics

Updated dataframe per the preprocessing above:

In [13]:
dfr.head()
Out[13]:
user_id experiment date day date, day hour operating_sys browser device_make clicked
0 0008ef63-77a7-448b-bd1e-075f42c55e39 1 2020-07-10 Friday 2020-07-10, Friday 8 6 Chrome Mobile Generic Smartphone 0
1 000eabc5-17ce-4137-8efe-44734d914446 1 2020-07-07 Tuesday 2020-07-07, Tuesday 10 6 Chrome Mobile Generic Smartphone 0
2 0016d14a-ae18-4a02-a204-6ba53b52f2ed 1 2020-07-05 Sunday 2020-07-05, Sunday 2 6 Chrome Mobile WebView E5823 1
3 00187412-2932-4542-a8ef-3633901c98d9 0 2020-07-03 Friday 2020-07-03, Friday 15 6 Facebook Samsung SM-A705FN 0
4 001a7785-d3fe-4e11-a344-c8735acacc2c 0 2020-07-03 Friday 2020-07-03, Friday 15 6 Chrome Mobile Generic Smartphone 0

Description of Updated Features:

  • user_id: ID of the survey impression, unique per user.

  • experiment: Whether the user belongs to control group: 0, or experiment group: 1.

  • date: Date in YYYY-MM-DD format.

  • day: Day of the week.

  • date,day: Concatenated dated and day of the week.

  • hour: Ηour in HH format.

  • operating_sys: Operating System of the user device platform, represented by a code.

  • browser: Browser on which the user sees the website and the survey.

  • device_make: Make and model of the user device.

  • clicked: Whether the user clicked: 1, or did not click: 0, on the survey.

In [14]:
dfr.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8077 entries, 0 to 8076
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   user_id        8077 non-null   object        
 1   experiment     8077 non-null   int64         
 2   date           8077 non-null   datetime64[ns]
 3   day            8077 non-null   object        
 4   date, day      8077 non-null   object        
 5   hour           8077 non-null   int64         
 6   operating_sys  8077 non-null   int64         
 7   browser        8077 non-null   object        
 8   device_make    8077 non-null   object        
 9   clicked        8077 non-null   int64         
dtypes: datetime64[ns](1), int64(4), object(5)
memory usage: 631.1+ KB

No null values are seen so far. Further checks will be performed in the next sections.

In [15]:
dfr.describe()
Out[15]:
experiment hour operating_sys clicked
count 8077.000000 8077.000000 8077.000000 8077.000000
mean 0.495976 11.615080 5.947134 0.153894
std 0.500015 5.734879 0.224333 0.360869
min 0.000000 0.000000 5.000000 0.000000
25% 0.000000 7.000000 6.000000 0.000000
50% 0.000000 13.000000 6.000000 0.000000
75% 1.000000 15.000000 6.000000 0.000000
max 1.000000 23.000000 7.000000 1.000000

Sanity Checks & Data Exploration

After the completion of the A/B testing, the first step is to perform sanity checks on the results. Specifically in this section;

  • Check whether the total number of users in control group and in experiment group are comparable. This minimizes the pooled Standard Error function (variance), used when applying the Central Limit Theorem. In addition, it is a first good sign that the user selection algorithm works properly.
  • Check whether the number of users between control and experiment group are comparable, within each user segment ("slice"). If not, conclusions for the overall click-through probability difference between the two groups cannot be made. This is because of the "Simpson's paradox", according to which a trend observed in certain data groups can even reverse when combining these groups.
  • Sign tests and Confidence Intervals are used to check the significance of any observed difference between the control and the experiment groups, only per user segment (i.e. for the different values of a user feature).
  • The dataset on hand does not provide information on other events on the website, that could be used for checking invariant metrics, i.e. metrics that should not vary significantly before and during the experiment. These could be for example the total number of cookies visiting the website, or the click-through probability of other buttons on the website.

Click-Through Probability vs. Rate

In [16]:
len(dfr), dfr['user_id'].nunique()
Out[16]:
(8077, 8077)
  • As seen above, the users of the dataset are unique. Therefore the Click-Through Probability will be considered, instead of the Click-Through Rate.
  • Assuming that the actions of all users are independent from each other, Click-Through Probability follows binomial distribution. Thus according to the Central Limit Theorem, the sampling distribution of the samples' click-through probabilities is a normal distribution.

Overall Click-Through Probability

In [17]:
test_summary = pd.DataFrame()
test_summary['Total # of Users'] = dfr.groupby('experiment').count()['user_id']
test_summary['Users Clicked'] = dfr[dfr['clicked'] == 1].groupby('experiment').count()[['user_id']]
test_summary['Click-Thru Prob.'] = (test_summary['Users Clicked'] / test_summary['Total # of Users']).round(3)
test_summary
Out[17]:
Total # of Users Users Clicked Click-Thru Prob.
experiment
0 4071 586 0.144
1 4006 657 0.164
In [18]:
sns.countplot(data = dfr, x = 'experiment')
plt.title('Number of users per group'); plt.ylabel('# of users'); plt.show()

The total number of users in the control and the experiment groups are comparable, but not equal. Whether the difference is significant or could have happened by luck, will be checked in two ways (for the sake of reference); using a Confidence Interval and a Hypothesis Test.

  • Finding 95% Confidence Interval for the probability of a user being assigned to the control group. It is assumed that user assignments to control group is a sequence of Bernoulli trials with probability of success 50%.
In [19]:
ci = sma.stats.proportion_confint(nobs = len(dfr), count = len(dfr[dfr['experiment'] == 0]), alpha= 0.05, method = 'normal')
print('The Confidence Interval is {}.'.format([ci[0].round(3), ci[1].round(3)]))
The Confidence Interval is [0.493, 0.515].

The above confidence interval includes 0.5, thus the difference is not statistically significant.

  • Two-tailed Hypothesis test with a 5% significance level:
In [20]:
ht = sma.stats.proportions_ztest(nobs = len(dfr), count = len(dfr[dfr['experiment'] == 0]),  value = 0.5, 
                            alternative='two-sided', prop_var=False)
print('The Z-statistic is {} and its p-value is {}.'.format(ht[0].round(3), ht[1].round(3)))
The Z-statistic is 0.723 and its p-value is 0.47.

The p-value is greater than the significance level, therefore the null hypothesis (i.e. that population's probability of assignment to control group is 0.5) cannot be rejected.

In [21]:
sns.barplot(data = dfr, x = 'experiment', y = 'clicked', estimator = np.mean, ci = None, palette = 'tab20c_r')
plt.title('Click-Through Probability per group'); plt.ylabel('click-through probability'); plt.show()

The observed difference between the click-through probability of the control and experiment group will not be checked for significance yet. This is contingent upon the outcome of the sanity checks of the following sections.

Decision Tree for # of Users per Group

As mentioned previously, part of the sanity checks is to verify that comparable numbers of users from control and experiment groups are assigned to each user "slice". User slices are defined by the various different values of the dataset features. The following sections explore features one-by-one, however it would be good to get a first idea of any features which stand out, when determining whether a user belongs to control or experiment group. This will provide an indication of unequal user distribution within these features.

A decision tree is used below, with a "min_impurity_decrease" limit determined after trial-and-error. The training set of the model includes the predictors: hour, date (dummies), operating_sys, browser (dummies). The response variable is whether a user belongs to control or experiment group, i.e. column 'experiment'. No train/test set split is required here.

In [22]:
from sklearn.tree import DecisionTreeClassifier, plot_tree
mdl = DecisionTreeClassifier(min_impurity_decrease = 0.01)
dfr_tree = pd.concat([dfr[['hour', 'operating_sys']], pd.get_dummies(dfr['date']), pd.get_dummies(dfr['browser'])], axis = 1)
mdl.fit(dfr_tree, dfr['experiment'])
plt.figure(figsize=(15,12)); plot_tree(mdl, max_depth = 20, fontsize = 18); plt.show()
In [23]:
dfr_tree.columns[[2, 13, 0]]
Out[23]:
Index([2020-07-03 00:00:00, 'Chrome Mobile WebView', 'hour'], dtype='object')

The features (or values of categorical features) that stand out are:

  • the date '2020-07-03',
  • browser 'Chrome Mobile WebView', and
  • feature 'hour'.

This information will be taken into account during the data exploration of the next sections.

'date' & 'day'

In [24]:
plt.figure(figsize = (7,4))
sns.countplot(x = dfr['date, day'].sort_values(), color = 'lightblue')
plt.title('Total number of users per date'); plt.xlabel('date'); plt.xticks(rotation = 70); plt.ylabel('# of users')  
plt.show()

As seen in the histogram for total number of users per date:

  • The experiment was conducted for 8 days, from Friday 7/3 to Friday 7/10. This span is not representative of the user behavior within a week. It seems odd that the duration of the experiment is 8 days, versus 7 or 14 days required to capture weekly seasonality.
  • Another odd fact is that the total number of users of the first Friday is almost double than the number of users on the second Friday, and generally a lot larger than the numbers of all the rest days. This means either that (a) one of these two Fridays may be a special day (e.g. holiday), thus should not be included in the experiment, or (b) that the algorithm selecting users may have a bug and does not direct the same % of traffic through the control and experiment groups every day. This leads to collecting a non-representative sample of the website users.
  • Assuming that the same % of all the website users is selected to participate in the experiment every day, on Saturday and Sunday there are less users than on Wednesday and Thursday. This may not be a typical behavior, so in real conditions more information should be requested from the website company, by the Data Scientists.
  • The experiment span includes 4th of July. It should be investigated whether the website users are located in the US, therefore this week should not be considered representative for the users' behavior.
In [25]:
plt.figure(figsize = (7,4))
sns.countplot(x = dfr['date, day'].sort_values(), hue = dfr['experiment'])
plt.title('Number of users per date & group'); plt.xlabel('date'); plt.xticks(rotation = 70); plt.ylabel('# of users')
plt.show()

As seen in the histogram for total number of users per date and group:

  • The previously used decision tree was verified regarding the fact that date '2020-07-03' has a highly uneven distribution of users between control and experiment group!
  • Primarily Friday 7/3, but also the rest days have a different number of users assigned to the control and experiment groups. The differences for most of the days are clearly statistically significant, not possible to have happened by luck. This means that:
  • (a) the collected user sample is not representative of all users of the website, because even if the %'s of control users were representative, the %'s of the experiment users are different and thus would not be representative,
  • (b) it would not be safe to make a conclusion for the overall click-through probability difference between the two groups, because Simpson's paradox may be present.
  • There must be a bug in the algorithm selecting users to participate in the experiment.
  • It was previously observed that the total number of users in the two groups are almost equal. This, in combination with the fact that the first Friday has tremendously more users in control than in experiment, may explain why the rest days systematically have more users in experiment than control: perhaps a mistake occurred on the first Friday and it was identified, so in the next days was made an effort to balance out the total number of users in each group. However this is not enough, equal number of users per group and per day are required, to avoid Simpson's paradox behavior.
In [26]:
plt.figure(figsize = (7,4))
sns.barplot(x = dfr['date, day'].sort_values(), y = dfr['clicked'], estimator = np.mean, 
            hue = dfr['experiment'], ci = None, palette = 'tab20c_r')
plt.title('Click-Through Probability per date & group'); plt.xlabel('date'); plt.xticks(rotation = 70) 
plt.ylabel('click-through probability'); plt.legend(loc = [1.01, 0.78], title = 'experiment'); plt.show()

For the 7 out of the 8 days of the test, the experiment group has a higher click-through probability than the control group. A sign test is performed below, to check if this behavior could have happened by luck or if it shows a significant trend. We assume that the days are a sequence of Bernoulli trials with binary outcome whether the control or experiment group has a higher click-through probability. Assuming that the success probability of either outcome is 50%, and considering a significance level of 5% for a two-tailed test, the p-value of the observed behavior is found below:

In [27]:
print('The p-value for the sign test is {}.'.format(round(2 * (1 - stats.binom.cdf(k = 7-1, n = 8, p = 0.5)), 2)))
The p-value for the sign test is 0.07.
  • The p-value is greater than the significance level, thus the fact that 7 out of 8 days show a higher click-through probability for the experiment group than the control could have happened by luck.
  • The uneven distribution of users between the two groups, per day, is still kept in mind as a caveat.

'hour'

In [28]:
plt.figure(figsize = (24,11))
i = 1
for j in dfr['date, day'].sort_values().unique():
    plt.subplot(2, 4, i)
    plt.hist(dfr[dfr['date, day'] == j]['hour'], bins = np.arange(25)-0.5, color = 'Lightblue')
    i = i + 1
    
    plt.title('{}'.format(j), fontsize = 14); plt.xticks(range(0, 24)); plt.xlabel('hour', fontsize = 14)
    plt.ylabel('Total # of Users', fontsize = 14); plt.ylim(0, 150)

As seen in the histogram of total number of users per date and hour:

  • There is an unusual spike in the number of users on Friday 07/03, at 15:00. It will be further investigated in the next graph which "slices" users per group as well.
  • On Friday 07/03, Monday 07/06 and Friday 07/10 there is a sudden drop in the total number of users past a certain hour. Although Friday hangouts can cause a reduction to the site traffic, it seems odd that the reduction is so drastic and there is almost no random traffic past that point at all. Wednesday has a high plateau of users after 13:00.
  • In a real project, more information should be requested from the website company to investigate the reason for the observed behavior, and check if the distributions of the selected users are representative of the overall website traffic, or if the selection algorithm has a bug.
  • Also it would be good to have data from a second full week of running the experiment, as a point of reference for the pattern of the users' behavior per day and within the day.
In [29]:
plt.figure(figsize = (24,11))
i = 1
for j in dfr['date, day'].sort_values().unique():
    plt.subplot(2, 4, i)
    plt.hist([dfr[(dfr['date, day'] == j) & (dfr['experiment'] == 0)]['hour'], 
    dfr[(dfr['date, day'] == j) & (dfr['experiment'] == 1)]['hour']], 
    bins = np.arange(25)-0.5, label = [0, 1], color = ['steelblue', 'darkorange'], stacked = False)
    i = i + 1
    
    plt.title('{}'.format(j), fontsize = 14); plt.xticks(range(0, 24)); plt.xlabel('hour', fontsize = 14)
    plt.ylabel('# of Users', fontsize = 14); plt.ylim(0, 150); plt.legend(title = 'experiment', fontsize = 14)

Plot the first subplot enlarged below, i.e. number of users per group for Friday 2020-07-03:

In [30]:
plt.figure(figsize = (5,5))
sns.countplot(x = dfr[dfr['date'] == '2020-07-03']['hour'], hue = dfr[dfr['date'] == '2020-07-03']['experiment'],
              order = range(24))

plt.title('2020-07-03, Friday'); plt.xticks(range(0, 24)); plt.ylabel('# of Users')
plt.legend(loc = 'upper left', title = 'experiment'); plt.show()

As seen on the histograms of number of users per group and date/hour:

  • The previously used decision tree was verified regarding the fact that feature 'hour' has a highly uneven distribution of users between control and experiment group!

  • There must have been a bug in the selection algorithm causing a tremendous spike for control users, on Friday 07/03 at 15:00. On that day, control users are selected only during the 15:00th hour of the day, and they are multiple times more than all the experiment users. In the rest hours of the day, only experiment users are selected. This uneven distribution within the day and per hour can cause a Simpson's paradox behavior, when analyzing the overall results of the experiment.

  • Except Friday 07/03, on all other days experiment users are systematically and considerably more than the control users per hour. Dividing all users exactly between control and experiment group may be challenging, given that this even distribution of users is needed per every user "slice" (day, hour, browser, etc.). For this, small divergences are expected, however here there is a clear trend, haven't happened by luck.

In [31]:
plt.figure(figsize = (22,10))
i = 1
for j in dfr['date, day'].sort_values().unique():
    plt.subplot(2, 4, i)
    sns.barplot(x = dfr[dfr['date, day'] == j]['hour'], y = dfr[dfr['date, day'] == j]['clicked'],
            hue = dfr[dfr['date, day'] == j]['experiment'], 
            estimator = np.mean, ci = None, palette = 'tab20c_r', order = range(24))
    i = i + 1
    
    plt.title('{}'.format(j), fontsize = 14); plt.legend(title = 'experiment', fontsize = 14, loc = 'upper left'); 
    plt.xticks(range(0, 24)); plt.xlabel('hour', fontsize = 14)
    plt.ylabel('Click-Through Probability', fontsize = 14); plt.ylim(0, 1.1)

As seen from the histograms of click-through probability per group and date/hour:

  • No clear trend is identified,
  • At any case, the above histograms could not be insightful, given the highly uneven distribution of users between control and experiment group per date and hour, as analyzed in the previous plots. For example on Friday 07/03, for all hours before 14:00, experiment users have high click-through probabilities and control users have zero probabilities. But this is because there were no control users during these hours at all.
  • As mentioned above, there must be a problem with the algorithm assigning users to each group. Also, these uneven distributions of users can cause a Simpson's paradox behavior to the overall results of the test.

'operating_sys'

In [32]:
dfr['operating_sys'].value_counts()
Out[32]:
6    7648
5     428
7       1
Name: operating_sys, dtype: int64

The platform operating system column includes codified values 5, 6, 7, the governing of which are 5, 6. In the following we need to:

  • find which specific operating systems correspond to codes 5, 6, 7, and
  • figure out whether the operating systems pertain to mobile or computer platforms.

For the above, information from column 'device_make' will also be utilized.

In [33]:
dfr['device_make'].value_counts()
Out[33]:
Generic Smartphone    4743
iPhone                 433
Samsung SM-G960F       203
Samsung SM-G973F       154
Samsung SM-G950F       148
                      ... 
Samsung SM-A205F         1
Samsung SM-C9000         1
XiaoMi Redmi 6           1
HTC U12+                 1
Samsung SM-G973U         1
Name: device_make, Length: 269, dtype: int64

The top 3 frequent (specific) cellphones of the dataset are iPhone, Samsung SM-G960F, and Samsung SM-G973F. It is known that iPhone has operating system IOS and the two latter phones have Android. Below is checked what codified operating systems these phones have in the dataset:

In [34]:
top_3 = dfr[dfr['device_make'].apply(lambda x: x in ['iPhone', 'Samsung SM-G960F', 'Samsung SM-G950F'])]
top_3.groupby(['device_make', 'operating_sys']).count()[['user_id']]
Out[34]:
user_id
device_make operating_sys
Samsung SM-G950F 6 148
Samsung SM-G960F 6 203
iPhone 5 428
6 5

Apparently, operating system '5' corresponds to IOS and '6' corresponds to Android. There are five iPhones misclassified as having Android, which is corrected below:

In [35]:
for i in range(0, len(dfr)):
    if dfr['device_make'].loc[i] == 'iPhone':
        dfr['operating_sys'].loc[i] = 5
C:\Users\Michael\anaconda3\lib\site-packages\pandas\core\indexing.py:670: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)

Check what operating system '7' is:

In [36]:
dfr[dfr['operating_sys'] == 7]
Out[36]:
user_id experiment date day date, day hour operating_sys browser device_make clicked
2332 4c4332e4-25ce-483b-a565-76a76a802ca6 1 2020-07-03 Friday 2020-07-03, Friday 13 7 Edge Mobile Lumia 950 0

Lumia 950 has operating system Microsoft Windows, which corresponds to code '7'. At this point we conclude that all devices of the dataset utilize either Android, or IOS, or Microsoft Windows, thus they are mobile devices.

The operating system codes are renamed below based on the information found:

In [37]:
dfr['operating_sys'] = dfr['operating_sys'].map({5: 'IOS', 6: 'Android', 7: 'Windows'})
In [38]:
sns.countplot(data = dfr, x = 'operating_sys', color = 'lightblue')
plt.title('Total number of users per operating_sys'); plt.ylabel('# of users'); plt.show()

As seen in the histogram above:

  • The percent of users with IOS is very low compared to users with Android. According to online statistical resources, the market share of iPhones is close to 45% in the US, whereas close to 27% worldwide. Thus, the country this dataset was taken from is not US.
In [39]:
sns.countplot(data = dfr, x = 'operating_sys', hue = 'experiment')
plt.title('Number of users per operating_sys & group'); plt.legend(loc = 'upper right', title = 'experiment')
plt.ylabel('# of users'); plt.show()

For Android, check if the difference of users between control and experiment groups could have happened by luck, or if it is statistically significant:

In [40]:
ci = sma.stats.proportion_confint(nobs = len(dfr[dfr['operating_sys'] == 'Android']), 
                                  count = len(dfr[(dfr['operating_sys'] == 'Android') & (dfr['experiment'] == 0)]), 
                                  alpha= 0.05, method = 'normal')
print('The Confidence Interval is {}.'.format([ci[0].round(3), ci[1].round(3)]))
The Confidence Interval is [0.481, 0.503].

For IOS, check if the difference of users between control and experiment groups could have happened by luck, or if it is statistically significant:

In [41]:
ci = sma.stats.proportion_confint(nobs = len(dfr[dfr['operating_sys'] == 'IOS']), 
                                  count = len(dfr[(dfr['operating_sys'] == 'IOS') & (dfr['experiment'] == 0)]), 
                                  alpha= 0.05, method = 'normal')
print('The Confidence Interval is {}.'.format([ci[0].round(3), ci[1].round(3)]))
The Confidence Interval is [0.676, 0.761].

As seen above:

  • The 95% Confidence Interval for the proportion of control users having Android, includes 0.50. Thus the difference between the two groups could happen by luck.
  • The 95% Confidence Interval for the proportion of control users having IOS, does not include 0.50, thus the difference between the groups is significant. It can trigger Simpson's paradox in the overall test results.
In [42]:
sns.barplot(data = dfr, x = 'operating_sys', y = 'clicked', hue = 'experiment', ci = None, palette = 'tab20c_r')
plt.title('Click-Through Probability per operating_sys & group'); plt.ylabel('click-through probability') 
plt.legend(loc = 'upper right', title = 'experiment'); plt.show()

For Android, check if the click-through probability difference between control and experiment group could have happened by luck, or if it is statistically significant:

In [43]:
ci = sm.stats.proportion.confint_proportions_2indep(
    nobs1 = len(dfr[(dfr['operating_sys'] == 'Android') & (dfr['experiment'] == 0)]), 
    nobs2 = len(dfr[(dfr['operating_sys'] == 'Android') & (dfr['experiment'] == 1)]), 
    count1 = len(dfr[(dfr['operating_sys'] == 'Android') & (dfr['experiment'] == 0) & (dfr['clicked'] == 1)]), 
    count2 = len(dfr[(dfr['operating_sys'] == 'Android') & (dfr['experiment'] == 1) & (dfr['clicked'] == 1)]), 
    method = 'wald', compare='diff', alpha= 0.05, correction=False)

print('The Confidence Interval is {}.'.format([ci[0].round(3), ci[1].round(3)]))
The Confidence Interval is [-0.032, 0.0].

As seen above:

  • Android users are the majority of the users, and in control group they have a lower click-through probability than in the experiment group. This difference is not statistically significant as the 95% Confidence Interval of the difference includes 0, at the limit.
  • Although Android users are distributed almost equally between control and experiment group, the uneven distribution of users in the two groups per other features is always taken into account as a caveat.

'browser'

In [44]:
dfr['browser'].sort_values().unique()
Out[44]:
array(['Android', 'Chrome', 'Chrome Mobile', 'Chrome Mobile WebView',
       'Chrome Mobile iOS', 'Edge Mobile', 'Facebook', 'Firefox Mobile',
       'Mobile Safari', 'Mobile Safari UI/WKWebView', 'Opera Mini',
       'Opera Mobile', 'Pinterest', 'Puffin', 'Samsung Internet'],
      dtype=object)

As seen in section 'operating_sys', all devices are mobile. Therefore, the above different browser names can be simplified. For example, there is no need having separate browsers named 'Chrome' and 'Chrome Mobile'; these can be merged into one category, 'Chrome'. This process is performed below for all applicable browser names:

In [45]:
dct_browser = {'Android': 'Android', 'Chrome': 'Chrome', 'Chrome Mobile': 'Chrome', 'Chrome Mobile WebView': 'Chrome',
       'Chrome Mobile iOS': 'Chrome', 'Edge Mobile': 'Edge', 'Facebook': 'Facebook', 'Firefox Mobile': 'FireFox',
       'Mobile Safari': 'Safari', 'Mobile Safari UI/WKWebView': 'Safari', 'Opera Mini': 'Opera',
       'Opera Mobile': 'Opera', 'Pinterest': 'Pinterest', 'Puffin': 'Puffin', 'Samsung Internet': 'Samsung'}
dfr['browser'] = dfr['browser'].map(dct_browser)
In [46]:
sns.countplot(dfr['browser'], color = 'lightblue')
plt.title('Total number of users per browser'); plt.ylabel('# of users'); plt.xticks(rotation = 60); plt.show()

As seen above;

  • The distribution of users into browser categories is in general agreement with the browsers' market share worldwide.
In [47]:
sns.countplot(data = dfr, x = 'browser', hue = 'experiment')
plt.title('Number of users per browser & group'); plt.xticks(rotation = 60); plt.ylabel('# of users')
plt.legend(loc = 'upper right', title = 'experiment'); plt.show()

For Chrome, check if the difference of users between control and experiment groups could have happened by luck, or if it is statistically significant:

In [48]:
ci = sma.stats.proportion_confint(nobs = len(dfr[dfr['browser'] == 'Chrome']), 
                                  count = len(dfr[(dfr['browser'] == 'Chrome') & (dfr['experiment'] == 0)]), 
                                  alpha= 0.05, method = 'normal')
print('The Confidence Interval is {}.'.format([ci[0].round(3), ci[1].round(3)]))
The Confidence Interval is [0.436, 0.461].

As seen above,

  • The previously used decision tree was verified regarding the fact that feature 'browser' has a highly uneven distribution of users between control and experiment group! (The decision tree specifically referred to browser type 'Chrome Mobile Webview', which was merged here with the rest Chrome-named browsers.)

  • Chrome is the browser for the majority of users. In addition to this, the number of Chrome users assigned to control group is smaller than the experiment group, and this difference is statistically significant. This means that there might be a bug in the selection algorithm, and that we cannot make a conclusion about the overall difference of click-through probability between the two groups, due to Simpson's paradox.

  • Given the high number of different user "slices", it may be challenging for the selection system to assign same number of users per slice, to each group. However, for the case of browsers, this should be done at least for the most frequent browsers of the dataset; Chrome, Facebook, Safari, and Samsung.

In [49]:
sns.barplot(data = dfr, x = 'browser', y = 'clicked', estimator = np.mean, hue = 'experiment', 
            ci = None, palette = 'tab20c_r')
plt.xticks(rotation = 60); plt.ylabel('click-through probability'); plt.legend(title = 'experiment', loc = 'upper right')
plt.title('Click-Through Probability per browser & group'); plt.show()

For Chrome (which applies for the majority of users), check if the click-through probability difference between control and experiment groups could have happened by luck, or if it is statistically significant:

In [50]:
ci = sm.stats.proportion.confint_proportions_2indep(
    nobs1 = len(dfr[(dfr['browser'] == 'Chrome') & (dfr['experiment'] == 0)]), 
    nobs2 = len(dfr[(dfr['browser'] == 'Chrome') & (dfr['experiment'] == 1)]), 
    count1 = len(dfr[(dfr['browser'] == 'Chrome') & (dfr['experiment'] == 0) & (dfr['clicked'] == 1)]), 
    count2 = len(dfr[(dfr['browser'] == 'Chrome') & (dfr['experiment'] == 1) & (dfr['clicked'] == 1)]), 
    method = 'wald', compare='diff', alpha= 0.05, correction=False)

print('The Confidence Interval is {}.'.format([ci[0].round(3), ci[1].round(3)]))
The Confidence Interval is [-0.046, -0.01].

As seen above:

  • Chrome users are the majority in the dataset, but they are not distributed equally between control and experiment group,
  • Chrome users in the control group have a lower click-through probability than in the experiment group. This difference is statistically significant as the 95% Confidence Interval of the difference does not include 0.

'device_make'

In [51]:
dfr['device_make'].nunique() 
Out[51]:
269
In [52]:
print(dfr['device_make'].value_counts()[0:15], 
'\n\nThe above devices constitute {}% of all user devices in the dataset.'.\
      format(100 * round(dfr['device_make'].value_counts()[0:15].sum() / len(dfr), 3)))
Generic Smartphone     4743
iPhone                  433
Samsung SM-G960F        203
Samsung SM-G973F        154
Samsung SM-G950F        148
Samsung SM-G930F        100
Samsung SM-G975F         97
Samsung SM-A202F         88
Samsung SM-A405FN        87
Samsung SM-J330FN        69
Samsung SM-A105FN        66
Samsung SM-G965F         66
Nokia$2$3                64
Samsung SM-G935F         63
Nokia undefined$2$3      60
Name: device_make, dtype: int64 

The above devices constitute 79.7% of all user devices in the dataset.

As expected, there are many different types of devices. The most frequent 15 devices are shown above, and they cumulatively constitute almost 80% of all devices of the dataset. Regarding the information about the device/make:

  • It is not necessarily needed for the sanity checks for the A/B testing. The algorithm assigning users to control and experiment groups is not needed/ expected to assign same number of users per group to each different type of device; this would be an exaggeration. Assigning same number of users per group to the rest of the user "slices" such as day, hour, browser, etc. would be sufficient.

  • For the sake of data exploration and for getting an idea of the mobile companies' shares in the market, we could categorize users into fewer categories based solely on device make, not model. However, there are many missing data points, i.e. the 'Generic Smartphone' ones, which constitute more than 50% of the dataset.

  • Through online "data scraping", information about device key features could be added to the dataset, such as screen size, aspect ratio, analysis. This would help understanding how these features affect click-through probability, using a Machine Learning model. However, the missing data points for device are too many in this dataset.

Conclusions & Recommendations

A summary of the observations made in the previous sections is shown below:

  • Within each user segment ("slice"), there is a different number of users assigned to control and experiment group. There must be a bug in the user selection algorithm. This can trigger a Simpson's paradox behavior, according to which a trend observed in certain groups of data disappears or even reverses when combining these groups. Also, this means that the user sample of the test is not representative of the user population on the website.
  • Within certain user segments (e.g. hours), the distribution of total number of users in control plus experiment group, is highly unlikely.
  • The test lasted for 8 days, which is not representative of the weekly trend of website users' behavior. It would be better to conduct the test for 7 or 14 days instead.
  • No information is provided about the business practical significance level (or minimum detectable effect), for the website company.

We conclude that it would NOT be safe to proceed with launching the experimental version "B" of the website survey.


More specifically, the per-feature analyses of the data shown the following:

  • Total number of users in control and experiment groups were comparable as seen here, however this is not sufficient (please see points below).
  • 'date': The experiment lasted for 8 days, not representative of the weekly traffic in the website. The distribution of total users per day seems odd, e.g. Friday 07/03 and Friday 07/10 have a very different number of total users. In addition, the number of users in control and experiment groups are not equal per day, with Friday 07/03 presenting a spike in control users.
  • 'hour': The distribution of total users per day and hour seems highly unlikely, e.g. on Monday 07/06 there are almost zero users past 10am. The number of users in control and experiment groups are not equal per day and hour, with hour 15:00 on Friday 07/03 presenting a spike in control users.
  • 'operating_sys': There is an uneven number of users between control and experiment groups, for IOS operating system.
  • 'browser': There is an uneven number of users between control and experiment groups for Chrome browser, which is the browser of the majority of the users.
  • 'device_make': All devices are mobile phones. The make and model of the user phones are not necessarily needed for analyzing the A/B test results. More than 50% of the phone types are missing and referred to as 'Generic Smartphone'.
  • A decision tree model was used to identify features with highly uneven distribution of users between control and experiment groups. According to the tree, these feature values were: 'Friday 07/03', hour '15:00' and 'Chrome' browser, and they were verified by the detailed analyses per feature!
  • Based on the above, it is not safe to make recommendations or to conclude about the click-through probability difference, even per user segment. This is because even when control and experiment users are equally distributed within a specific segment of interest, at the same time these users are not equally distributed as per other segments.

Recommendations to fix the problem:

  • Run a new A/B testing to check the click-through probability of the two versions of the website survey.
  • Ensure that the total number of users in control and experiment groups are comparable.
  • Ensure comparable numbers of users assigned to control and experiment group within each user segment, i.e. per date, hour, operating system, and browser. This is to avoid Simpson's paradox behavior. Verify that the user selection algorithm of the website works properly.
  • The sample of users for the testing shall be representative of the entire traffic of the website. For instance, if almost 80% of all the website users have Android, this should be the percentage of users with Android within the testing sample as well.
  • Set a business significance level (minimum detectable effect), only beyond which is financially worthy to launch an experimental version.
  • Determine the required sample size for the testing based on the desired business significance level and statistical power. This process is presented in the next section for reference.
  • Determine the experiment duration based on the required sample size, the total traffic of the website, and the percentage of traffic that the website is comfortable to run through the experiment. Select the duration as a multiple of one week (7, 14 days etc.) to capture weekly trends of the user behavior.
  • Can add certain invariant metrics which should not change before and during the experiment, to further check that the system works as designed. These could be for example the total number of cookies visiting the website, or the click-through probability of a button which appears in the webpage, before the survey of interest.
  • The results of the testing shall be analyzed, and invariant and sanity metrics shall be checked. The statistical and business significance of any observed difference between the two groups shall be obtained. Depending on the outcome of all these, there are also other factors that play a role on whether an experimental version shall be launched, such as past experience and business intuition. Please note that all the above steps are mentioned here in a synoptic/ descriptive way.

Sample Size Calculation for New Testing

This section shows the process of estimating the required sample size for a new testing.
The first step is to calculate the baseline click-through probability, i.e. the mean click-through probability of the website survey observed so far, for the "control" version of the survey. Just an estimate of this can be obtained through the available data. Keeping in mind the issues referred to in the previous section as a caveat, and taking out the first day of the experiment (which has unusual spikes in control user numbers), the baseline click-through probability is estimated as follows:

In [53]:
print('The baseline click-through probability of the control group is {}.'.\
      format(round(dfr[(dfr['date'] != '2020-07-03') & (dfr['experiment'] == 0)].mean()['clicked'], 3)))
The baseline click-through probability of the control group is 0.14.

However this is the click-through probability of the control group from the testing sample, not of the population of all users of the website. Below is found the 95% Confidence Interval of the probability of the user population:

In [54]:
ci = sma.stats.proportion_confint(nobs = len(dfr[(dfr['date'] != '2020-07-03') & (dfr['experiment'] == 0)]), 
                    count = len(dfr[(dfr['date'] != '2020-07-03') & (dfr['experiment'] == 0) & (dfr['clicked'] == 1)]), 
                    alpha = 0.05, 
                    method = 'normal')
print('The Confidence Interval is {}.'.format([ci[0].round(3), ci[1].round(3)]))
The Confidence Interval is [0.126, 0.153].

It is known that the Standard Error (variance) increases with the increase of the click-through probability, when the latter is <= 0.50. To be conservative, the highest possible baseline probability will be selected to size the experiment.

The selected statistical sensitivity (sometimes referred to as "power") of the experiment is 80%, and a business significance level (minimum detectable effect) of 0.02 is assumed:

In [55]:
target_power = 0.8
mid_det_effect = 0.02
significance = 0.05
baseline_ctr = ci[1]

# Initial parameters for the loop:
power = 0; users_per_group = 1

while power < (target_power - 0.01/100) or power > (target_power + 0.01/100):   # Define convergence tolerance: +/- 0.01%
    power = sm.stats.power.normal_power_het(diff = mid_det_effect,       
                                nobs = users_per_group,     
                                alpha = significance,
                                std_null = np.sqrt(2 * baseline_ctr * (1 - baseline_ctr)), 
                                std_alternative = np.sqrt(baseline_ctr * (1 - baseline_ctr) + 
                                           (baseline_ctr + mid_det_effect) * (1 - (baseline_ctr + mid_det_effect))),       
                                alternative='two-sided') 
    users_per_group = users_per_group + 1
  
print('{} users are needed for EACH group, to achieve a sensitivity of {}.'.format(users_per_group, round(power, 2)))
5172 users are needed for EACH group, to achieve a sensitivity of 0.8.


---------------------------------------------------------------------------- / END OF NOTEBOOK, THANK YOU! / ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- © 2021 Michail Mavrogiannis

You are welcome to visit My LinkedIn profile and see my other projects in My GitHub profile!

Michail Mavrogiannis